Probability and Statistics: The Science of Uncertainty: Defining Relationships via Conditional Distributions

Welcome to a paradigm shift in statistics. We are moving beyond the simple intuition of "trend lines" to a rigorous Distributional Framework. Here, we define a relationship not just by a correlation coefficient, but as any change in the probabilistic behavior of a response variable $Y$ when the predictor $X$ is varied.

Definition 10.1.1: The Statistical Bond

Two variables $X$ and $Y$ are considered related if there is any change in the conditional distribution of $Y$, given $X = x$, as $x$ changes. Conversely, a state of "no relationship" is mathematically equivalent to the independence of $X$ and $Y$.

Logical Equivalence

Variables $X$ and $Y$ are unrelated if and only if $f(y|x) = f(y)$ for all values of $x$. This implies that the joint relative frequency function can be factored as:

$$f(x, y) = f(x)f(y)$$

Therefore, testing for a relationship is fundamentally a test of Independence.

Mechanisms of Change

A relationship is identified by any shift in the conditional density function (as shown in Figure 10.1.1). This includes:

Mean Shift: The expected value $E(Y|X)$ changes (the most common focus).
Variance Shift: The spread or uncertainty of $Y$ depends on $X$ (Heteroscedasticity).
Shape Change: The overall distribution transforms (e.g., from symmetric to skewed).

Establishing Causality through Design

A statistical relationship does not imply causality. To claim that $X$ causes $Y$, we must account for confounding variables through the Design of Experiments:

Control Treatments: Provides a baseline for comparison.
Placebo Effect: Mitigation of perceived improvement through inactive treatments.
Blinding: Using blind experiments (recipients unaware) and double-blind experiments (recipients and researchers unaware) to eliminate bias.
Blocking: As seen in Example 10.1.7, we use blocking variables ($W$, like soil fertility) to ensure the relationship between wheat type ($X$) and yield ($Y$) is not confounded by pre-existing conditions.

🎯 Core Mathematical Estimation

We estimate these bonds using Conditional Likelihood functions. For discrete data with counts $f_{ij}$:

$$L = \prod_{i=1}^a \prod_{j=1}^b (\theta_{j|X=i})^{f_{ij}}$$ Standard Error: $SE = \sqrt{\frac{\hat{\theta}_{ij}(1 - \hat{\theta}_{ij})}{n}}$

QUESTION 1

According to Definition 10.1.1, what must happen for $X$ and $Y$ to be considered related?

The correlation coefficient between $X$ and $Y$ must be exactly 1.

The conditional distribution of $Y$ given $X=x$ must change in some way as $x$ changes.

$X$ and $Y$ must have a functional relationship $Y = g(X)$ where $g$ is linear.

$X$ and $Y$ must be independent.

QUESTION 2

Suppose $Y$ has conditional distribution given $X$ specified by $N(1 + 2x, |x|)$ when $X = x$. Are $X$ and $Y$ related?

Yes, because the mean ($1+2x$) and variance ($|x|$) both change as $x$ changes.

No, because $N$ is always a normal distribution.

Only if $x$ is a positive integer.

No, because they are independent.

QUESTION 3

In a clinical trial, what is the purpose of a 'double-blind' experiment?

To ensure the sample size is doubled to improve the power of the test.

To prevent both the subjects and the researchers from knowing who received the treatment or placebo.

To make sure that only two different dosages are tested.

To satisfy the requirements of a multinomial likelihood function.

QUESTION 4

Why is the functional approach $Y = g(X)$ often insufficient for practical statistical applications?

Because math functions cannot be used in statistics.

Because real-world relationships involve stochastic uncertainty or unobserved factors that $g(x)$ doesn't capture.

Because $g(X)$ always requires $X$ to be a categorical variable.

Because likelihood functions only work for independent variables.

QUESTION 5

Suppose $X$ takes values 1 and 2, and the conditional distributions of $Y$ given $X$ are $N(0, 5)$ when $X = 1$, and $N(0, 7)$ when $X = 2$. Do $X$ and $Y$ have a relationship?

No, because the mean is 0 in both cases.

Yes, because the variance (the spread) of $Y$ changes from 5 to 7.

No, because a relationship requires a change in the expected value.

Only if $Y$ is a discrete variable.